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ABSTRACT 

In response to growing frustration over the lack of 
information about the national effectiveness of the Chapter 1 
program, Congress enacted the Education Amendments of 197A, Section 
151 of the Amendments directed the U,S, Office of Education to 
develop evaluation models that would allow school district data to be 
aggregated to provide national estimates of program effectiveness. 
The norm-referenced model was the nost easily applied of the 
alternatives developed. This model substitutes test norms for a 
traditional comparison group. Posttest standing relative to the norm 
group is compared with pretest standing relative to the norm group. 
The 1975 document, ''A Practical Guide to Measuring Project Impact on 
Student Achievement," specified the conditions in which the 
norm-referenced model could be used. Several difficulties have arisen 
in implementing the models, but school districts today are still 
required to evaluate their Chapter 1 projects. Requirements enacted 
in 1988 mean that districts essentially must use nationally normed 
tests or tests equated to nationally normed tests to measure student 
achievement in both basic and more advanced skills. Test norms and 
norm-referenced tests are reviewed, with attention to measurement 
error, the effects of high-stakes testing, and the relevance of 
national norms as a comparison group. Ways in which information on 
program effectiveness could be better provided are discussed. Five 
figures and one table illustrate the discussion. (SLD) 
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Development of the Title I (Chapter 1) Evaluation and Reporting System 

Title I of the Elementary and Secondary Education Act of 1965 was enacted to provide spe- 
cial educational assistanc^^ to students in areas impacted by poverty. The law mandated that 
local school districts repon annually on program effectiveness, but did not require specific 
reporting formats or standards* Subsequent attempts to estimate the effectiveness of the 
program nationwide were less than overwhelmingly successful* (See Hope Associates, 
1979, for a review of the early history of Title LTitle I is now Chapter 1 erf the Education 
Consolidation and Improvement Act, or EdA.) In response to growing frustration over the 
lack of information on the national effectiveness of the program. Congress enacted the 
Education Amendments of 1974* Section 151 of the Amendments directed the U.S- Office 
of Education (USOE) to develop evaluation models which would allow district data to be 
aggregated to provide national estimates of program effectiveness. 

USOE responded by contracting with the RMC Research Corporation to develop evalua- 
tion models which would provide a national estimate of the effectiveness of Title I, and, in 
1975, USOE released "A Practical Guide to Measuring Project Impact on Student Achieve- 
ment" (Horst, Tallmadge, and Wood, 1975), which provided Title I projects with assess- 
ment methods which allowed evaluation results to be aggregated across projects 
nationwide* The "Practical Guide" provided five evaluation models for districts to consider 
when measuring project impact: 

• A posttest comparison using matched groups; 

• Analysis of covariance; 

• Special regression (either regression projection or regression 
discontinuity models); 

• A general regression model; and 

• The norm-referenced model. 

All models were designed with the goal of providing *'as clear and unambiguous an answer 
as possible" to the question "How much more did pupils learn by participating in the pro- 
ject than they would have learned without it?" (Horst, Tallmadge, and Wood, 1975, p- 1) 

Not all of the five models in the "Practical Guide" turned out to be practical for a majority 
of school districts, and by 1976 the five models had been reduced to three: a comparison 
group model, a special regression model, and the norm-referenced model. Each of these 
models existed in two forms, one a norm-referenced version; the other a non-normed ver- 
sion. (See Tallmadge and Wood, 1976.) There were still problems. Me t school personnel 
in 1976 lacked either the expertise or the computers to easily implement the regression 
model, and the comparison group model was more often than not impossible to implement 
due to program rules which specified that the most needy schools and students were to 
receive services. This left, by default, the norm-referenced model 



The Requirements 



I^e norm-referenced model substitutes test non^ 

^ercem r.t. °™ g^o^P-^ssentiaUy, it is assumed that in the absence (rfa treatment 
chTaffn ? ''""T ^'^^ ^ ^^^^tment had bccnprovided. any 

change m percentile standing can be attributed to the treatmem. Tliis assun^tion is com^ 

strengths of the model that: e 

"Where no comparison group is available, the norm group provides a 
plausiole estimate of no-treatment posttest scores. Even where a comparison 
group IS available, unless it comes from the same population as the treatment 
group, the Norm-Referenced Model offers a more defensible estimate of 
posttest performance at substantially less cost and effort than a comparison- 
group model. 

The weakness was that: 

"The validity of the model rests on the assumption that the achievemem 
status of a particular subgroup remains constant relative to the norm group 
over the pre- to posttest interval if no special treatment is provided. Empiri- 
cal support for this assumption is minimal. It is conceivable that some sub- 
groups would move up and others move down in the normal course of events 
When the norm group is like the treatment group, the plausibility of the un- " 
derlying assumption is greatly enhanced; thus, for example, norms for gifted 

Hors[^7ar 1975) ""'"''^^ ' ''"^'"^ '"'^ P'^^" ^ ^ ^ 

fn^ed wa; a mfdll^n ^ model selection gave the clear impression that the norm-refer- 
enced was a model of last resort, with a decision tree that led one to the model only after 

wi hto'hS " """^"'^ P^-^^^ the po^'sou eft 

o preferen ri^H^r J V.if "^^f as it was noted that "Despite the ciear order 

. P'"''"'^^ considerations may well counterbalance scienZ 

r/;,! n m'''^"^^ ' implemented Model A [the norm-refbrenced 

model] ^vlll y^eld more credible results than a poorly implemented Mode^mthe com 
panson group model] or C [ the special regression model]." (Tallmadge and Wood. 1976. 

mu?t us^^ sta7/.Tdf ""T "lodel were informed that they 

sei^= t the nrn/ ^'''^ °^ '''' ^^^^ P^^' posttesting. ^ 

and posttestin. on the norming dates used by the test publisher.^X t:test was used to com- 
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pare students' actual average posttest score (using standard scores, since normal curve 
equivalents, or NCEs had not yet been added to the system) with the expected posttest 
score (based on the pretest mean), and the resulting t value was tested for statistical sig- 
nificance at the .05 level. If the difference was statistically significant, one next checked for 
educational significance, with a general rule of thumb being to determine whether the ob- 
served posttest scores exceeded the expected posttest scores by a third of a standard devia- 
tion- 
Users were allowed to use tests wiiiout national norms (e.g., district or state tests) if the 
tests were equated to nationally normed tests. The inclusion of an option for non-normed 
tests was perhaps practically and politically necessary: while the movement against stand- 
ardized testing was less well-defined 15 years ago than today, critics existed. The typical al- 
ternative of the day, however, was the district or state criterion-referenced test, and in order 
to use these tests, districts were required to equate them to tests with national norms. For 
the most part, this proved difficult, either because of the costs involved in the additional 
testing required or because the tests; measured different types of skills and equating was im- 
possible. After all, if the locally developed test was so similar to the nationally normed test 
that there was a very high correlation between the scores on the two tests, the odds were 
that the tests were similar enough that it was easier to just use the standardized iest, and 
most districts did just that 

Districts were allowed to assess student achievement using either a fall-to-spring testing 
cycle or an annual cycle (e.g., spring to spring or fall to fall), and they were cautioned to: 

(1) use the same level and form of the test at pretest and posttest, 

(2) use functional level testing, 

(3) test within 2 weeks of the publisher's norming dates, and 

(4) select students for the program on some other measure than their pretest 
scores. 

In the 1976 guide, raw scores were converted to NCEs, and testing for statistical sig- 
nificance of the gains was eliminated. By 1981, the revised "Evaluator's Reference" 
proclaimed that "All NCE gains greater than zero are good!** 

After the models were developed, USOE began working with State and local education 
agencies to implement them, and funded 10 Technical Assistance Centers (TACs) to help 
school districts with their Chapter 1 evaluations. The 1979-80 school year was the first year 
of full district implementation of the models. 



The Reaction 



Complaints about the models began immediately. A suivey by Hope Associates (1979) 
found complaints from State education agencies, Technical Assistance Centers, and some 
USOE staff that evaluations were not useful for school districts, that the requirements ex- 
cluded "process" evaluation, and that the Tide I "evaluation" system was really a Title I 
reportmg" system. Congressional aides, on the other hand, felt that they were developing a 
system that would be usefiil for SEAs and LEAs. 

USOE staff at the time-of which I was one -noted in response that they were required to 
develop a system that would provide a means of aggregating district information to provide 
a national estmiate of program effect, and that districts were not precluded from sup- 
plementmg the basic system with more information. In addition, USOE routinely chal- 
lenged cntics to propose alternative systems which would meet the basic requirement of 
providmg a means of aggregating data to provide national estimates of program effective- 
ness. While there was interest in making the evaluations be as useftil as possible for school 
distnct personjiel, there was a definite feeling among many of the evaluators involved that 
the data would have limited utiUty for veiy small projects because of the measurement error 
associated with the model and the appropriateness of comparison to national norms. While 
many of these errors would balance out when the data were combined at the national level 
school personnel were advised to base program decisions on a variety of sources of 
evidence and not to make judgments about either pupils or programs based on one test 
score, or, for that matter, on any other single piece of evidence. 

By 1981 Model A was enough a part of the evaluation landscape that Linn (1981) included 
'"fJx? . .If °" measuring pretest-posttest performance changes. He noted that 
with Model A [t]here is a certain intuitive appeal to the assumption that in the absence of 
special intervemions students will tend to maintain the same relative standing in achieve- 
ment. Intuitive appeal is not a very adequate basis for an evaluation assumption, however 
Furthermore even ,f the assumption were justified under idealized conditions, it still would 
ead to biased estimates of project effects as it is implemented in practice." He also noted 
that [t]he model rests on a strong assumption for which there is no adequate basis" -i - 
the equi-percentile assumption -and provided examples of both factors that could leadTo 
positive bias (e.g., regression effects) and factors which could lead to negative bias, includ- 
ing dissimilarity of the project group and the group on which the test was normed- "By their 
very nature, special programs are often designed for groups that are quire dissimilar to the 
usual norming sample. Due to this dissimilarity, the expectations based on the norms may 
be too high or too low." Linn concluded (p. 95) that "[i]n summaiy, the constant NCE ap- 
proach to defining normal growth and thereby providing a means for project evaluation 
lacks an adequate justification. In some applications, it may be seriously defective and 
result in biased estimates of the effects of an educational program " 
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Other authors reanalyzed existing data bases to determine the actual growth of compen- 
satory education students over time and, thus, the validity of the equi-perccntile assump- 
tion. JIaskowitz and Norwood (1977) produced a USOE sponsored review of utility of the 
norm-referenced model for the evaluating Project Information Packages (PIPs). (PIPs were 
specific projects which were being disseminated by USOE for possible adoption by other 
sites.) They noted that evidence from the first year PIP evaluation indicated that compen- 
satory education students may be the students who lose ground over time, relative to the 
norm population, and they analyzed Metropolitan Achievement Test (MAT) data for 
several different groups of students to attempt to determine whether that was indeed the 
case. Tbey^pted that the MAT "empirical growth curves do indicate that the equipercen- 
tile growm curve tends to underestimate expected posttest scores for extremely low pretest 
scores and tends to overestimate expected posttest scores for extremely high pretest scores 
across grades levels and tests." When they examined actual performance for a group of stu- 
dents who were not in the Follow-Through program (Le^ were not receiving compensatory 
education), they found an average loss of approximately 6 percentile points each year from 
the spring of the first grade to the spring of the third grade in reading; in math, there was a 
6 point drop from the first to the second grade, and no further drop at the end of third 
grade. There were difference patterns of score changes for white and minority group stu- 
dents, but the direction of changes varied across the grades examined. The authors con- 
cluded that "[u]se of the norms based on the standardization group will lead to an expected 
posttest score that will be too high for students ordinarily in compensatory programs, espe- 
cially minority students who have pretest scores that are not extremely low" and "[f|or 
pupils with extremely low pretest scores, the equal percentile assumption leads to a 
predicted standard score that is much lower than what was observed.*' (Kaskowitz and Nor- 
wood, 1977, p.55) In other words, depending on which students you have, your expected 
posttest score can be either too high or too low. 

Tallmadge (1982) used Sustaining Effects Study (SES) and California Achievement Test 
(CAT) national norming data to assess the norm-referenced evaluation methodology. He 
provided project level (i.e., school within grade) data from both the SES and CAT, as 
shown in Table 1, which provided evidence that in the aggregate — that is, across many 
projects — the norm-referenced design provided reasonable estimates of program impact. 
The SES data provided information on schools which did not have Title I or other compen- 
satory education programs, while the CAT data base in all likelihood included some stu- 
dents who were receiving cither Title I or other compensatory services. An examination of 
the data from both sources shows that, on average, the equi-percentile assumption is more 
or less valid. The CAT data, which most likely include remedial students, show slight posi- 
tive biases (which may not be biases at all, but actual program effects) and the SES data, 
which do not include compensatory education students, show overall gains near zero, but 
with grade differences, particularly at Grade 2, where students without compensatory ser- 
vices actually lost ground relative to the norm. 
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DO ttes T^^ ti^e projects cxperiendng a lo^ in percentile standing from pretest to ^ 
— t^rnnVf overaU -that is, at the nadonal level -the eqdpercenUle assumption 

T^nS ' / measurement contains error, some positive, some negative 

l^^l^^roZT' P^«^^^ smaller^dard dev^aS bui, as a 

mem eKT4T^^ "^.'^ °^ ^^^^ 20 students will be required before real trTat 

?=i?l!7 , "^^^^^^y discriminated fr^^ (n 110? 

Hostel - f ^"^""^ ^^^^^^'^ ^ '^t^de^ts- total school experience. If the interven- 
at le^ St t^^^^^^^^^ ^ the casTwith S oTe^ t is 

test 5 theTc^^^^^^^^^ *° ^'^P^^t^ ^P^<^ of the intervention from thatof ie 

mLe b^^^^^^^ P ograia ^a superior school produced above-average growth rates, th^ 
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growth rates, and effective Title I project would be made to appear less effective.'' 
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ecTllr °^ °"^^"*'"^tion of gains" which was "evidence of the inappropriateness of 
the equ percemile assumption" Tallmadge (1985) responded by reinterpSu'Swe^^^^^^ 

iTZdTuT' ? ^ ^ acceptably accurate" no-treatmem expectations for low achiev- 
ing students m large, medium, and small LEAs of varying size and urbanidty 



Tabic 1 



Estimates of Project EfTcctiveness Usmg the Norm-Referenced Modct 
Project Data Presented by Tallmadge, 1982 



Sustaining Effect Study Data Base 
Grade 2 
Grade 4 
Grade 6 

California Achievement Test Data Base 
Grade 2 
Grade 4 
Grade 6 



Number 
of 

Projects 



103 
99 
97 



102 
109 
109 



Mean 
NCE 



iiandard" 
Deviation 



-.95 
.48 
.19 



1.79 
.80 
1.49 



5.36 
3.89 
3.14 



5.56 
3.53 
3.85 
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For the most part, however, once it was clear that the system was not going to go away, 
State and local education agency (LEA) personnel implemented the models without much 
more ado, and the Department of Education began producing annual reports on Chapter 1 
participation and effectiveness which used the data submitted by the States. Certain pat- 
terns in the data were noted. First, as had been noted in some prior studies, the estimates 
of student achievement based on fall-to-spring gains were considerably higher than those 
based on annual testing, and they did not seem to be sustained over time. That is, if one 
tested the same students the following fall, the estimate of their achievement gain was con- 
siderably lower than the estimate based on the spring posttest score. The losses were 
variously attributed to summer forgetting, inaccurate norms, stake-holder bias, and other 
causes. Whatever the cause, however, the gains obtained with fall-to-spring testing were not 
a good measure of actual student achievement over a full year. Second, gains tended to be 
higher in the lower grades than in the higher grades. Whether this was due to differences in 
test norms, to needier students being served in the higher grades (who might be more dif- 
ficult to help), or to the greater effectiveness of intervention at the lower grades was never 
thoroughly investigated. Third, mathematics projects tended to show higher gains than read- 
ing projects. Again, this may have been due to differences in norms, differences in the types 
of students served (the math students tended to have higher pretest percentiles than the 
reading students), or real differences in project effectiveness for reading and mathematics. 
The practical effect of these differences was that schools and districts which concentrated 
serWces in the lower grades and in mathematics projects, and which tested fall-to-spring, 
tended to look more successful that did schools which had reading projects, served students 
in the higher grades, and which used annual testing. For this reason, achievement data 
were kept separate by grade level, testing cycle, and subject matter. 

Requirements Today 

School districts today still are required to use the models to evaluate their projects. In fact, 
use of the models has b- tn expanded beyond the original intent of providing national es- 
timates of program effectiveness to use for program improvement. 

Section 1019 of Public Law 100-297, which was enacted in 1988, mandates that local educa- 
tion agencies (LEAs) must conduct evaluations at least once every three years in accord- 
ance with the national standards developed under Section 1435. Essentially, this means 
that they must use natidnally normed tests, or tests equated to nationally normed tests, to 
measure student achievement. The law now requires them to measure student achievement 
in both basic and more advanced skills. The "advanced skills" measurement in practical ap- 
plication means that in reading they must give a reading comprehension subtest and in 
math, a math problems and applications subtest, and report on these separately. The use of 
the evaluation models has been potentially expanded by Section 1021, which deals with 
school program improvement. Any school which "does not show substantial progress" 
towards meeting its program goals or "shows no improvement or a decline in aggregate per- 
formance of children served under this chapter for one school year as assess by measures 
developed pursuant to section 1019" must develop a school program improvement plan 
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which includes technical assistance, alternative curriculum, improving coordination with the 
regular classroom program, evaluates parent involvement, and provides inservice training. 

The regulations (Federal Register, 1989) and PoUcy Manual (1990) provide further 
guidance. LEAs must evaluate their programs at least once cveiy three years, and, for their 
reading, mathematics, and language arts programs in grades 2 and above, they must use 
norm-referenced tests (or tests equated to nationally normed tests) to provide an estimate 
of what their achievement would have been without the Chapter 1 program, give the pretest 
and posttest at least 12 months apart, and calculate the estimate of gain using the NCE 
metric. Certain positive changes in requirements are to be found. One is that the fall-to- 
spring evaluation cycle has been eliminated. Projects must use either a spring-to-spring or 
a fall-to-fall testing cycle to measure project impact. The Policy Manual also emphasizes 
that LEAs must conduct evaluations based on both aggregated student achievement and 
other desired outcomes: for example, success in the regular classroom and attaining grade 
level proficiency. 

All schools that serve 10 or more students in their Chapter 1 programs (across grade levels 
and subjects) must develop a school improvement plan. As part of the plan, achievement 
data are aggregated by subject area for grades 2 and above in each school building. 
(Programs at the pre-kindergarten, kindergarten, and grade 1 level are exempted from 
using normed achievement tests for evaluation due to concerns about test reliability and 
validity for children at those levels.) Each school must look at its pretest to posttest change 
in achievement standing, and "No gain or a decline in aggregate performance scores in the 
subject that is the primary focus of the Chapter 1 program, as measured according to the na- 
tional standards for evaluation, causes a school to be identified for program improvement." 
(Policy Manual, page 154) Schools are not allowed to place confidence bands around the 
estimates of performance to determine whether the gains or losses are statistically sig- 
nificant. Testing for statistical significance is considered to be "an effort to avoid program 
improvement", not an attempt to determine the meaningfulness of the scores. Schools 
which believe that a few extreme scores may have influenced their gains may use the 
median rather than the mean -but only if all other schools in the district also use the 
median -and schools worried about measurement error are encouraged to use multiple 
measures to assess individual student progress. 

In the 1987-88 school year, over one milh'on Chapter 1 reading students and over 625,000 
mathematics students in Grades 2 through 12 had both pre- and posttest scores. These stu- 
dents represented (some States sample school districts) over 1.6 million reading students 
and nearly 975,000 math students -a large number, but less than two-thirds of the reading 
and math students at these grades (Sinclair and Guttman, 1990). The estimated percent- 
ages of Chapter 1 students for whom achievement data were provided ranged from 74 per- 
cent for Grade 3 reading to only 14 percent for Grade 12 reading. 
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A Brief Review of Test Norms and Nonn*Referenced Tests 

In order to understand what confidence one can place on standardized test scores and the 
interpietation of norm-referenced test results, it helps to keep in mind several conditions 
which influence test scores, including the effect of measurement error, the effects of **high 
stakes" testing, and the relevance of the norm group as a comparison for the students being 
assessed. 

Measurement Error 

Most people realize that students* test scores are just estimates of their performance and 
that the scores will vaiy from testing to testing due to chance. However, not everyone real- 
izes just how much scores are likely to vary. How likely is it that a student's score might vary 
by several raw score points by chance alone? It is very likely. The standard error of measure- 
ment on the ITBS Level 9 Form J for grade 3 students is 2,8 raw score points for the read- 
ing comprehension subtest and L94 raw score points for the math problems subtest. For the 
MAT-6, Elementary Level, Form L, grade 3 students have a standard error of measurement 
of 3-1 raw score points on the reading comprehension subtest and 22 on the math problem 
solving subtest Thus, about one-third of the time, a student's "true" score could be at least 
3 points higher or lower on the MAT and ITBS reading comprehension subtests and about 
2 points higher or lower on both math problems subtests. Obviously, if a project has large 
numbers of students, measurement errors of this sort balance out. However, for projects 
with very few students, this will not necessarily be the case, and observed gains or losses 
may be due as much to measurement error as to program quality (or lack thereof.) 

Furthermore, many people do not realize what difference a few items can make in a 
student's percentile rank on a norm-referenced test, although many authors have com- 
mented on this, Shepard (1989) examined both the California Achievement Test (CAT) 
and the Stanford Achievement Test (SAT) and determined that a gain of one raw score 
point at the median in reading, language, and mathematics could translate into a percentile 
gain of from 2 to 4 points, I examined the Iowa Tests of Basic Skills (ITBS) and the 
Metropolitan Achievement Test (MAT) Reading Comprehension and Mathematics 
Problems subtests appropriate for students in grade 3, and found a similar situation, al- 
though the actual gain varies by test and percentile standing. It is obvious that one raw score 
point must translate into more than one percentile point gain when you look at the number 
of items on the tests: given that most subtests have fewer than 99 items, a gain of more than 
one percentile point for each raw score point is inevitable, Wliat is less clear is what effect 
this may have at different points on the score distribution. 

On the ITBS Torm J Reading Comprehension subtest, which has 42 items, one additional 
item correct for a grade 3 student assessed in the spring results in a percentile change of 
from 0 to 5. (See Figure 1,) The NCE gains range from 0 to 9. A student going from 12 to 
13 items correct would move from the 9th to the 11th percentile (the 22nd to the 24th 
NCE), but a student going from 13 to 14 items correct would move from the 11th to the 
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Figure 1 



Effect of Raw Score Changes on Percentile Standing - Reading Compichensioa 
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Figure 2 

Efifcct of Raw Score Changes on Percentile Standing -Math Problem Solving 



MAT-6, Form I, Math Problt m SoMng 
Spring of Grndft 3, B^ntnttry Ltvtl 
Raw Soort to Ptrotntilff Convarsions 



ITBS Fonm J, Math Problams 
Spring of Grada 3, Laval 9 
Raw Scora to Paroantita Convaraiona 



Parcentile 



Parcantila 





9 10 11 12 13 14 15 16 17 18 
Raw Scora 



10 11 12 13 14 15 16 
Raw Scora 



The effect of measurement error on a student's percentile standing can be seen in Figure 3. 
As noted above, the standard error of measurement on the Form L, Reading Com- 

prehension subtest for grade 3 students is 3.1 raw score points. For a grade 3 student who 
obtains a raw score of 27 on the MAT reading comprehension, there is a one-third chance 
that her "true" percentile standing is less than 15 or greater than 25 - a wide range, indeed. 

The practical effect for projects which are relying on just one score to assess achievement is 
that instruction can appear to have helped -or to have hurt -students when the observed 
change due not to the instruction but to error. As the number of students increases, the 
errors will balance out, but for small projects, there can be large observed changes from 
chance alone. 
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Figures 

CoaGdence Bands for Individual Student Scores: 
Bands are for Plus or Minus One Standard Error of Measurement 
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Effect of "High Stakes" Testing 

Measurement error, of course, is not the only factor which can effect students' scores in 
ways that can effect interpretation of project impact when using standardized tests. There 
have been numerous complaints about the effects of "high stakes" testing (see Herman, 
Golan, and Dreyfus, 1990, for one review.) Herman et. al. note that in the past, tests were 
not expected to alter curriculum and instruction, but with the advent of "high stakes" test- 
ing - e.g., situations where teachers, schools, and districts are rated or ranked based on 
achievement test scores -teachers are more likely to focus their instruction on those areas 
measured by the test. While some nught find this positive -after all, if the test measures 
important areas, it makes sense to teach those areas -one should recognize that tests 
generally measure a fairly narrow set of objectives. In a survey of 85 teachers, Herman 
et. al. found that low socio-economic elementary schools give the most fittention to test 
results, and the reaction is to engage in additional test preparation activities. They found 
that elementary school teachers spend the equivalent of several weeks on test taking 
strategies, with teachers with decreasing test scores engaging more often in these activities. 
Furthermore, there was a definite Chapter 1 effect: schools with larger numbers of Chap- 
ter 1 students were more likely to feel pressure to raise test scores and to spend more time 
not only trying to cover all of the curriculum but also on test preparation. 
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Shepard (1989) investigated possible reasons for inflated test score gains, many of her find-^ 
ings are elevant to problems which may occur in Chapter 1 program evaluation. She noted 
that: 

*Tests are selected to achieve the best possible match between the test and 
the curriculum. ... Once the test is chosen that best fits the curriculum, the 
pratical curriculum is adjusted further in response to the test** 

As noted above, this is not necessarily bad— assuming that one selects a test to measure 
one's goals, focusing instruction towards those goals, and therefore to the test, makes sense. 
But, Shepard notes that: 

"Narrowing of curriculimi does, however, alter the meaning of normative 
comparisons. The original standardization sample did not have the benefit of 
such focused instructioa Students in the norming sample were apparantly 
learning the tested content and other things as well when they took the unan- 
nounced test." 

In addition, children in grade 1 through 3, and sometimes grade 4 as well, take a practice 
test first, unlike children in the norm sample, which further restricts one's ability to inter- 
pret the students' normative standing. Shepard concludes by stating that: 

**In this study we have been concerned primarily with what test-curriculum 
alignment and teaching to the test might do to the meaning of scores. There 
is ample evidence here and elsewhere, however, that these practices harm in- 
struction and learning as well." 

Given the difference that a few raw score points can make in students' test scores, and the 
pressure to produce project test score gains, it seems likely that Chapter 1 teachers would 
be tempted to focus their instruction on the narrow areas measured by the tests. In addi- 
tion, inadvertent focus on specific vocabulary or topics which the teachers know are covered 
on the particular test that they are using can easily result in test score improvements which 
invalidate normative comarisons. 

The practical effect is that teachers who do not narrow their curriculum to the areas 
covered by the test are at a disadvantage when compared to those teachers who do. If 
project -and teacher— success is measured by test score gains, more teachers are going to 
feel pressure to "teach to the test" in order to improve their students' scores. Unfortunate- 
ly, these score improvements will not necessarily mean that the students are receiving im- 
proved instruction and curriculum. 
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Relevance of National Norms as a Comparison Croup 



The underlying rationale of the norm-referenced model is that the national norms provide a 
valid comparison group. If we are comparing projects nationwide with the national norms - 
the original intent of the model— it may. However, for individual projects, there may be less 
reason to assume that such comparison is valid. There is an implicit assumption that the 
principal factor in assuring that the con:5)arisoii group is valid is the pretest score. However, 
the students at, say, the 15th percentile in the national norm group come from a variety of 
schools and communities. Some of the students who score at the 15th percentile are in 
schools which have veiy high concentrations of poor children, limited resources, and poor 
regular programs. Others are not. If the original premise of Chapter 1 is correct - that is, 
that schools with high concentrations of poor children have special problems -then the 
Chapter 1 teachers in these schools are being held to a more difficult standard than are the 
Chapter 1 teachers in less poor schools. At the reverse end of the scale, many Chapter 1 
programs are in schools and districts which have a very low percentage of poor children; 
The Chapter 1 projects in these districts may have an "easier" comparison. The "Practical 
Guide" (1975) noted that one evaluation hazard to be avoided was the "use of non-com- 
parable treatment and comparison groups" and noted that project personnel should look 
not just at pretest standing but also at differences in age, sex, race, or socio-economic status. 
While test publishers today are paying much more attention to ensuring that tests are free 
of sex and race biases, there are still different achievement patterns for students in high and 
low poverty areas. How much difference can this make? One due comes from looking at 
the norms for schools in low and high sodo-economic areas compared with the national 
figures on the ITBS Reading Comprehension subtest, which has 1985 norms for all three 
groups. As shown in Figure 4, the percentile equivalents for different raw scores differs 

Figure 4 

Differences Between Low Sodo-economic Status and High Socio-economic 
Status Schools on the Iowa Test of Basic Skills 

ITBS Rexfing Comprthtntion, Gr*d« 3 
fW Score to Ptrctnlilt Convtttion* 
For Oifftr*n1 Nofming Group* 
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considerably, which is not a surprising finding given what is known about the relationship of 
school poverty and achievment. 



The actual impact of raw score point changes on percentile changes may be less apparant, 
and is illustrated in Figure 5. At the low end of the percentile range, each additional raw 
score point results in a larger percentile change on iht low SES norms than on the national 
norms. For example, for a student to move from the 19th to the 25th percentile on the na- 
tional norms takes slightly over 2 additional raw score points (from 15 to 17 + ). However, 
only one additional raw score score point on the low SES norms (from 12 to 13) raises a stu- 
dent from the 19th to the 25th percentile on the low SES norms- 
While is is essential that we maintain as a program goal that students in areas which are 
heavily impacted by poverty be provided with instruction and other support which enables 
them to catch up with children from other areas, we need to be equally aware of the difficul- 
ties that their teachers face, and measure the teachers' effectiveness against reasonable 
comparisons. 

One additional problem that occurs today is that many students in the national norming 
samples for tests are actually receiving Chapter 1 or equivalent instruction, particularly at 
the lower grade levels. Nationwide, about 20 percent of children in grades 1 through 3 
receive Chapter 1 services, ahhough not all of them receive l eading or math. To the extent 
that the norms include students served :)y remedial programs, comparison with the norms is 
not comparison with a no-treatment expectation but rather with an alternative treatment. 



Figures 

ITBS 1985 Norms for Schools in Low SES Areas and for the Nation 
Grade 3, 1985 Sprmg Norms 
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Conclusions 



The Chapter 1 (formerly, Title I) evaluation and reporting system was developed to provide 
a national estimate of the effectiveness of the Chapter 1 program -that is, to provide an es- 
timate of how much more students learned with the program than they would have learned 
without It. The system is now used not only to provide an estimate of national effectiveness 
but also to estimate effectiveness at the project level. It is time to consider revising this sys- 
tem, ana to provide alternatives for measuring both project impact and national effective- 
ness. 

Information on the national program effectiveness of Chapter 1 could be better provided by 
national samples and special purpose studies which coUected more detailed information on 
samples of students rather than by forcing nearly all districts to test their students eveiy 
year. The results of the current system provide limited information for smaU projects, and 
despite the large numbers of students tested-over one milUon in reading and over 625 000 
in niathematics during the 1987-88 school year-these students represent less than two- 
thirds of the students in the program at Grades 2 through 12, and therefore the results may 
not even be a good representation of national program effectiveness. Furthermore, there is 
a danger that "high-stakes" testing may be causing teachers to narrow their curriculum to 
the hmited areas covered by the test, as well as inadvertently teach to specific areas of the 
test. Both of these procedures invalidate comparison with the national noims, and the 
former may mean that students are provided with worse, not better, instruction. 

Until alternatives are implemented, we need to take steps to minimize misinterpretation of 
the results of Chapter 1 evaluations. These include re-instituting the requirement to test 
changes m scores for statistical significance and to add confidence bands to our estimates of 
individual project effectivenss. This is especially important for small projects. We also must 
make it clearer that no decisions on Chapter 1 projects should be made based on only one 
piece of data, and work with project staff to provide meamngful alternatives, including as- 
sistmg them with designing evaluation systems which assess student changes over longer 
periods of time, provide for alternative measurements of student progress, and add assess- 
ments of the instructional program which provide teachers with valid information about 
how to modify their programs in ways which will actually improve student achievemem 
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